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Abstract 



We present a method to stop the evaluation of a prediction process when the result 
of the full evaluation is obvious. This trait is highly desirable in prediction tasks 
where a predictor evaluates all its features for every example in large datasets. 
We observe that some examples are easier to classify than others, a phenomenon 
which is characterized by the event when most of the features agree on the class 
of an example. By stopping the feature evaluation when encountering an easy- 
to-classify example, the predictor can achieve substantial gains in computation. 
Our method provides a natural attention mechanism for linear predictors where 
the predictor concentrates most of its computation on hard-to-classify examples 
and quickly discards easy-to-classify ones. By modifying a linear prediction al- 
gorithm such as an SVM or AdaBoost to include our attentive method we prove 
that the average number of features computed is 0{^/n\ogS~^) where n is the 
original number of features, and S is the error rate incurred due to early stopping. 
We demonstrate the effectiveness of Attentive Prediction on MNIST, Real-sim, 
Gisette, and synthetic datasets. 

1 Introduction 

We wish to avoid evaluating all the weak hypotheses for each example, such that we evaluate less 
features for easy-to-classify examples. However, to filter an example we need to know whether it is 
informative or not. The majority vote is used to measure how important an example is for learning, 
this is proportional to the magnitude of the majority vote. When there is a strong agreement between 
the features then the majority vote will have larger magnitude than when the features disagree. 
Our goal is to compute the least number of weak hypotheses possible before we decide whether 
the majority vote will end below or above an importance threshold. Filtering out un-informative 
examples, and trying to compute as few hypotheses as possible are closely related problems |2|. 

The intuition behind this work is prevalent in many natural decision making domains. For example, 
in finance, suppose we are interested in buying a certain stock, and we have 10 financial advisors 
at our disposal. If we sequentially ask them whether to buy the stock, and the first four consecutive 
advisors agree we should buy, we may stop, and decide to buy. With low probability we might incur 
an error since the remaining six advisors might all vote in the opposite direction. If on the other hand 
the advisors vote opposing to each other "yes, no, yes, no...", we will not gain enough confidence 
in either direction, and will end up asking all of them. Such is the case also in medicine, when the 
doctor tries to diagnose whether we have a certain condition, he will put us through a sequence of 
tests. If they all come out negative, then he will stop, and diagnose that we are healthy (with a certain 
error rate), if the are all positive, he will stop and diagnose that we have the particular condition. 
However, if the test oppose each other, he will keep on testing with more refined (and probably more 
expensive) tests. Finally, in face detection, the same type of attention mechanism also holds. In their 
seminal work 1 16] proposed an attentional cascade, where the face detector could stop the evaluation 
of the classifier if it did not find any of the important features. The thinking was, if there are no eyes. 
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Figure 1: Two examples are classified. The first is hard to classify, the second easy. The bud- 
geted learning approach would evaluate the same number of features for both examples, whereas the 
stochastic would evaluate features according to how hard is the example to classify, while maintain- 
ing an average budget of 0{^/n) features. 



no nose, no mouth and no ears, don't look for a chin. Attention is achieved in these cases by limiting 
the amount of computation/time to examples that are easy to classify, and increasing computation 
time for harder-to-classify examples. 

All these decision making problems fall under the same roof of an underlying attention mechanism 
that stops computation when a the result of the full feature evaluation will produce with high likeli- 
hood the result of the partial evaluation, ie. an obvious positive or negative example. We argue that 
by thresholding the partial evaluation of all votes at each point with a constant, predicting the label 
we obtain once the partial evaluation first hit the constant stopping threshold, we obtain an attention 
mechanism that is similar in nature to the reasoning behind these three examples. 

We propose to early stop the computation of feature evaluations for these examples by connecting 
Sequential Analysis (181 |9l and Brownian motion analysis to margin based learning algorithms. 

We use the terms margin and full margin to describe the summation of all the feature evaluations, 
and partial margin as the summation of a part of the feature evaluations. The calculation of the 
margin is broken up for each example in the dataset. This break-up allows the algorithm to make 
a decision after the evaluation of each feature whether the next feature should also be evaluated 
or the feature evaluation should be stopped, and the label predicted. By making a decision after 
each evaluation we are able to early stop the evaluation of features on examples with a large partial 
margin after having evaluated only a few features. Examples with a large partial margin are unlikely 
to have a full margin below the required threshold. Therefore, by rejecting these examples early, 
large savings in computation are achieved. This is quite different from the budgeted approach where 
a constant smaller number of features is evaluated for all examples, in which case, all examples are 
treated equally. 

Instead of looking at the classification error we look at stop-error. Stop-errors occur when the 
algorithm stops the partial feature evaluation and predicts a label that is opposite to the label if would 
have predicted if it had evaluated all the features. We demonstrate that a simple rule, comparing 
each partial sum to a constant stopping-threshold, can speed-up a linear predictor while maintaining 
generalization accuracy. 

This paper proposes a simple novel test based on Sequential Analysis and stopping methods for 
Brownian motion to drastically improve the computational efficiency of margin based learning al- 
gorithms. Our method accurately stops the evaluation of the margin when the result is the entire 
summation is evident. Furthermore, our novel algorithm can be easily parallelized. 

2 Related Work 

Margin based learning has spurred countless algorithms in many different disciplines and domains. 
The most directly applicable machine learning algorithms are margin based online learning algo- 
rithms. Many margin based Online Algorithms base their model update on the margin of each 
example in the stream. Online algorithms such as Exponentiated Gradient 1 8 1 and Online Boosting 
1 10] update their respective models by using a margin based potential function. Passive online al- 
gorithms, such as the Perceptron ifTSl and online passive-aggressive algorithms |j6|, define a margin 
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Figure 2: Example of the differences between Budgeted prediction and Attentive prediction. At- 
tentive prediction thresholds partial scores, whereas budgeted prediction thresholds the number of 
features evaluated. 



based filtering criterion for update, which only updates the algorithm's model if the value of the mar- 
gin falls below a defined threshold. All these algorithms fully evaluate the margin for each example, 
which means that they evaluate all their features for every example. Recently Shalev-Shwartz et al. 
(151 [141 proposed Pegasos, an online SVM solver. The solver is a stochastic gradient descent solver 
which produces a maximum margin classifier at the end of the training process. 

However, if there is a cost (such as time) assigned to a feature evaluation we would like to design 
an efficient learner which actively choose which features it would like to evaluate. Similar work on 
the idea of learning with a feature budget was first introduced to the machine learning community 
by Ben-David and Dichterman 1 1 1. The authors introduced a formal framework for the analysis 
of learning algorithm with restrictions on the amount of information it can extract. Specifically 
allowing the learner to access a fixed amount of attributes, which is smaller than the entire set of 
attributes. Very recently, both Cesa-Bianchi et al. | 4 1 and Reyzin 1 12J studied how to efficiently learn 
a linear predictor under a feature budget (see figures [T] and [2]) Also Clarkson et al. | 5 1 extended the 
Perceptron algorithm to efficiently learn a classifier in sub-linear time. 

Similar active learning algorithms were developed in the context of when to pay for a label (as 
opposed to an attribute). Such active learning algorithms are presented with a set of unlabeled 
examples and decide which examples labels to query at a cost. The algorithm's task is to pay for 
labels as little as possible while achieving specified accuracy and reliability rates |7, 3|. Typically, 
for selective sampling active learning algorithms the algorithm would ignore examples that are easy 
to classify, and pay for labels for harder to classify examples that are close to the decision boundary. 

Our work stems from connecting the underlying ideas between these two active learning domains, 
attribute querying and label querying. The main idea is that typically an algorithm should not query 
many attributes for examples that are easy to classify. The labels for such examples, in the label 
query active learning setting, are typically not queried. For such examples most of the attributes 
would agree to the classification of the example, and therefore the algorithm need not evaluate too 
many before deciding their importance. 

3 The Sequential Thresholded Sum Test 

The novel Constant Sequential Thresholded Sum Test is a test which is designed to control the rate 
of stop-errors a margin based learning algorithm makes. This section describes its adaptation. 

3.1 Mathematical Roadmap 

Our task is to find a filtering framework that would speed-up margin-based prediction algorithms 
by quickly classifying obvious examples. Quick classification is done by creating a test that stops 
the score evaluation process given the partial computation of the score. We measure the difficulty 
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(a) A simulation of the Constant- STST boundary with 
Xi ^ A^(0.05, 1). The required decision error rate is 
on the X-axis, and the actual is on Y-axis. Since apply- 
ing the Constant-STST results in lower error rates than 
required, we observe that the boundary is conservative. 



Expected and actual stopping times for the Brownian bridge boundary 




number of features n 

(b) The boundary behaves similarly to what is expected 
from theory. It computes in the order of O(v^) fea- 
tures. 



Figure 3: Performance of the Brownian bridge boundary. 



in classifying an example by the magnitude of its score. We define as the prediction threshold, 
where examples that are predicted as negative have a score smaller than and the rest are predicted 
as positive. Statistically, this problem can be generalized to finding a test for early stopping the 
computation of a partial sum of weighted independent random variables when the result of the full 
summation is guaranteed with a high probability. Given a required decision error rate we will derive 
the Constant Sequential Thresholded Sum Test (Constant-STST) that will provide a constant early 
stopping threshold that maintains the required confidence. 

Let the sum of weighted independent random variables (X^,i = 1, ...,n) be defined by Sn = 
Yl7=i ^i^i^ where Wi is the weight assigned to the random variable Xi. We require that Wi G 
RjXi e [—1,1]. We define Sn as the full sum. Si as the partial sum. Once we computed the partial 
sum up to the ith random variable we know its value Si . Let the stopping threshold at coordinate i 
be defined by . 

Pelossof et al. ifTTl previously proposed a Curved-STST by looking at the following conditional 
probability 

u^Q \ P{Sn> 0, Stop before n) 

P{Sn > O\stop before n) = — r , (1) 

F[stop before n) 

where the event "stop before n" is the event which occurs when the partial sum crosses a stopping 
boundary stop = {^^ < r^} at any point i along the stopping curve, yielding the prediction Sn < 0. 
The simplicity of deriving the curtailed method stems from the fact that the joint probability and the 
stopping time probability are not needed to be explicitly calculated to upper bound this conditional. 
The resulting curved stopping boundary, gives a constant conditional error probability throughout 
the curve, which means that it is a rather conservative boundary. 

However, if we are interested of controlling stop errors for a given set of examples, we are interested 
in a slightly different conditional probability P(stop before n\Sn < 0). Such is the case in many 
classification tasks where there are significantly more negatives than positives in the dataset. This 
formulation results in a more aggressive boundary which allows higher stop error rates at the begin- 
ning of the evaluation and lower stop error rates at the end. Such a boundary stops more evaluations 
early on, and less later later on. This approach has the natural interpretation that we want to shorten 
the feature evaluation for obvious negative samples, but we want to prolong the evaluations for pos- 
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itive samples. A Constant boundary achieves this exact "error spending" characteristic. Note that 
this equation can be flipped in order to stop the evaluation of positive examples as well. 



3.2 The Constant Sequential Thresholded Sum Test 

We condition the probability of making a decision error in the following way 
7-./ 1 r ir. Pfstop before n, S'n > 6>) 

P(stop before n\Sn > 0) = o/c ^ m (2) 

We stated in equation |2] a conditional probability function which is conditioned on the examples of 
interest. Therefore in this case we are interested in limiting the stop error rate for examples that 
are important. To upper bound this conditional we will make an approximation that will allow us 
to apply boundary-crossing probability estimation for a Brownian bridge. To apply the Brownian 
bridge to our conditional probability we note that 

P(stopped before n\Sn>0)= P {msiX Si < r\Sn > 0) ^ P {mm Si <r\Sn = 0). (3) 

i i 

where r < is a constant stop threshold. The last approximation holds when the event {Sn > 0} is 
rare, i.e. EXi < 0, and n is large, so that the event is concentrated on Sn being close to 6. Now we 
can approximate the stop error rate by calculating the corresponding boundary crossing probability. 

Lemma 1. Let = mf{i : Si > r} be the first crossing time of the random walk over constant r. 
Then the probability of the following decision error, P(Tr < n\Sn = 0) is approximately equal to 

_ 2t(t-6>) 

e -^(^r-iSr^) when n is large. 

Proof By the Functional Central Limit Theorem, we know that S^tn]/^^^ t ^ [0^ 1]^ converges 
to the Brownian Motion process. The conclusion of the lemma then follows from Lemma 2 in the 
Appendix. □ 



Theorem 1. (Suppose that = 0). Choose r = ^y var{Sn ) yjlog for the constant boundary. 
Then the rate for decision error {T^- < n\Sn < 0} is approximately 5. 

Proof. The Theorem follows from Lemma [T] directly by plugging in ^ = and r = 

^yvar{Sn)^Jlog^. □ 

When using this boundary for prediction, we can directly see the implication on the error rate to the 
classifier, since the error rate of an attentive predictor is equal to at most the error rate of the full 
predictor plus the stop error rate. 



3.3 Average Stopping Time for the Curved and Constant-STST 

We now show that the expected number of features evaluated for the Curved and the Constant STST 
boundaries is in the order of 0{^n \og5~^-^). This is obtained by limiting the range of values Xi 
can take. Without loss of generality, the theorem is proved for the case where we early stop the 
computation of positive predictions, and can be applied also for the mirrored case of early stopping 
negative predictions. 

Theorem 2. Suppose that random variables Xi are bounded, i.e. \Xi\ < k for a constant k, and 
that EXi > 0- Let the stopping time be defined by T = mf{i : Si > ^/var{Sn) \ogS~^-^}. Then 
the expected stopping time is of order 0{^/n\ogS~^•^). 

Proof. 

ESt = ESt-1 + EXt < ESt-1 ^k< ^var{Sn)\og6-^-^ + k (4) 

The second inequality holds since the random walk only crossed the boundary for the first time 
at time T and therefore was under the boundary at time T — 1. By applying Wald's Equation 
ESt = ETEX [17}, we get 

ET=y^^^^^^^^^^=OiV^), (5) 
EX 
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where c, /c, and EX are constants. 



See figure 3(a) for simulation results. 



□ 



4 Experiments with Attentive Prediction 

We apply the above stopping rule to linear prediction. For the different classification tasks 'Gisette', 
'MNIST 2 vs. 5', and 'Real-sim' we treat the data in the same way. We split the data to training and 
test sets, train an SVM classifier on a training set, and predict on the held out test set. We calculate 
the mean kernel value for each support vector on the positive test set jii = EiKii{Xi^Xi) (1 is an 
index over positive examples, and i is an index over support vectors), and store it as constants. When 
predicting we evaluate the corrected score as St = Yl\=i (^i{K{Xi, X) — jii), which removes the 
trend from the positive set. In our experiments to induce independence between the support vectors 
we randomly permute the order of the support vectors and then calculate the partial scores. In 
experiments where we sorted the support vectors by | we observed unstable results. 

We conducted three experiments to test the loss of the Attentive approach versus a Budgeted ap- 
proach and the Full classifier. We tested the algorithm on the Gisette dataset which includes 6,000 
training examples, 1,000 test examples, and 5,000 features. An SVM was trained on this dataset 
with a linear kernel, and C = 1, resulting in a model with 1084 support vectors. Our second dataset 
was MNIST digits 2 vs. 5, which comprised of 28x28 images of the number 2 and the number 5. 
The training set included 11,379 images, and the test set 1,924. We trained an SVM with an RBF 
kernel with cr = 0.1, C = 1 and obtained a model with 781 support vectors. Finally we trained 
an SVM on the Real-sim dataset which is comprised of 72,309 examples and 20,958 dimensions. 
The data was split to 2,000 randomly chosen test examples, and the rest were used for training. For 



training we used a linear kernel with C=l. The results of the full training can be seen in figure 4.1 
top row in red. We obtained highly accurate classifiers. The to compare Attentive prediction with 
Budgeted prediction we thresholded the partial scores at a certain threshold, obtained a confusion 
matrix for that threshold, as well as the average number of support vectors evaluated. We then ran 
Budgeted prediction on the exact same set with a budget set to the average number of support vectors 
evaluated by the Attentive predictor and obtained a comparable confusion matrix. Since the bounds 
as seen by the simulation are not tight we tested all possible thresholds, setting r G {min Su^ ..^0} 
for the entire dataset. We also set the prediction threshold 6 = 0. We tested for each example in the 
test set AttentivePredict{K{.^ X) — jj.^ a, r), where K{.^ X) — /i is the centered Kernel evaluation, 
and a is a vector of weights obtained by the SVM. Similarly we tested Budgeted prediction, with a 
budget equal to the average number of features calculated by AttentivePredict over the entire test set 
BudgetedPredict{K{.^X) — jj.^ a, 6^-). 

Algorithm 1 Attentive Prediction 
AttentivePredict(X,w, r) 
if 3i : Yj7=i ^i^i < ^ then 

return r 
else 

return Yl't^i ^i^i 
end if 



Algorithm 2 Budgeted Prediction 



BudgetPredict(X,w, b) 
return Yl^^^i ^i^i 



The results of the early stopping algorithms can be seen in figure 4.1 The top row shows the 
precision recall for the different algorithms over the three sets. AttentivePredict outperforms Bud- 
getedPredict on all tasks. The reason for this is probably since the attentive threshold is only applied 
as a one sided test, therefore it is likely to reject mostly negative examples and classify them as 
negatives, thereby lowering the false positive rate of this classifier. Negative examples that may 
have been predicted by full feature evaluation as positive, now have a chance of being rejected at an 
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Figure 4: Comparison of Attentive Prediction with Budgeted Prediction. Top, Precision-recall for 
the actual performance of the end classifier. Bottom, computation/stop-error tradeoff. Attentive 
prediction outperforms Budgeted Prediction in all classification tasks. Since the Attentive predictor 
thresholds negatives at a constant, these predictions are void of magnitude and there is a steep fall 
off in the precision recall plot at that point. In terms of the stop-error/computation tradeoff. Attentive 
prediction requires less computation than Budgeted prediction if the required error rate is less than 
30%. 



interim point. Furthermore, since the distribution of partial scores for the positives is typically above 
that of the negatives, one can set up an attentive threshold that positive examples would never reach, 
and negative examples might hit, thereby improving the FP rate at no expense. From the bottom 
part of the figure we can see that indeed the Attentive predictor produces lower stop error rates than 
the budgeted predictor at the low range of under 30% stop error rate. Beyond that, the Attentive 
threshold gets very close to the positive distribution at the beginning of the feature evaluation, and 
is too aggressive. A more conservative attentive predictor can be used at this range. We observe that 
the attentive predictor can significantly lower the number of SVs evaluated (up to 50% less) without 
much loss in predictive power. 



4.1 Conclusions 



We sped up prediction algorithms up to 50% without significant loss in predictive accuracy. We 
proved that the expected speedup under independence assumptions of the weak hypotheses is 
0{^n \og5~^-^) where n is the set of all features used by the learner for discrimination. 

As a future direction we wish to study the performance of such algorithms when the independence 
assumptions are broken, by sorting data by feature weight or sampling according to other measures 
of importance. 

The thresholding process creates a natural attention mechanism for linear predictors. Examples 
that are easy to classify (such as background) are filtered quickly without evaluating many of their 
features. On the other hand examples that are hard to classify, the majority of their features are 
evaluated. By spending little computation on easy examples and a lot of computation on hard "in- 
teresting" examples the Attentive algorithms exhibit a stochastic focus-of-attention mechanism. 
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5 Appendix 



Lemma 2. Let Su, u > be a continuous time Brownian motion process and = \id{u : Su > 
t}. Then, for 6 < r and t > 0, 

P{Tr <t\St = 0)= exp . (6) 

1^ var{St) J 

Proof. We can look at an infinitesimally small area dO around 0. Then, by definition of conditional 
probability, 

PiTr < t\S, -9)- p(5^g^^) , (7) 
where St G dO denotes St G [6>, 6> + d9). For the numerator, we have, by the reflection principle, 

P{Tr <t,Ste dO) = P{Tr <t,Ste2r- dO) = P{St G 2r - dO) 

V vcLr{St) \ V var{St) J 
where (j) is the standard normal density function. But we certainly know that for the denominator 

P{St e dO) = . ^ I . ^ I dO. 

^/var{Sn) \^/var{Sn) J 



Plugging the preceding two equalities back into|7j we get 

2r-6> 

[ - 

P{Tr <n\Sn = 0) 



^var{St) 



f i {2T-ef 1 ) 

'''"'^ \ ~ 2 var{St) ^ 2 var{St) J 



^var{St) ^ 



□ 
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